more scaffolding updates #511

dpark01 · 2024-02-13T15:03:59Z

scaffold_and_refine_multitaxa:

more resilience to empty outputs or partial recovery of multi-segment genomes
update input reference genome data structure to tsv instead of json string
auto-filter input reference genome list (optionally) based on kraken hits
run polish only if scaffolding successful--if not successful, fallback to refbased assembly as hail mary attempt
more top-level (terra table) outputs

classify_single:

more top-level workflow outputs summarizing primary viral taxa found, and how much

containers:

bump viral-classify 2.2.3.0 to 2.2.4.0

build:

update cromwell/womtool version

…multi-segment genome are recovered

tomkinsc · 2024-02-14T21:40:15Z

pipes/WDL/workflows/scaffold_and_refine_multitaxa.wdl

@@ -79,7 +70,7 @@ workflow scaffold_and_refine_multitaxa {
            "assembly_length_unambiguous" : refine.assembly_length_unambiguous,
            "reads_aligned" : refine.align_to_self_merged_reads_aligned,
            "mean_coverage" : refine.align_to_self_merged_mean_coverage,
-            "percent_reference_covered" : 1.0 * refine.assembly_length_unambiguous / refine.reference_genome_length,
+            "percent_reference_covered" : select_first([percent_reference_covered, 0.0]),


It may be nice to break out tax_name and percent_reference_covered for the "top" viral assembly into separate workflow outputs, for easier search and filtering on Terra (where "top" could be defined as the most complete assembly, or the most abundant taxon in terms of # of reads or # of matching distinct k-mers).

Added into the TO DO comments at the bottom of the WDL. I think this will require a small bespoke tsv-parsing task for this purpose. It will also need to be reslient to the empty-output scenario (ie, there is no top assembly because none were attempted or were successful).

tomkinsc · 2024-02-15T18:04:54Z

pipes/WDL/workflows/scaffold_and_refine_multitaxa.wdl

-
-        Int    num_read_groups                       = refine.num_read_groups[0]
-        Int    num_libraries                         = refine.num_libraries[0]
+        Array[Map[String,String]] assembly_stats_by_taxon  = stats_by_taxon


Any reason we can't make this type Map[ String, Map[String,String] ], where the outer map String keys are the taxid or tax_name values? (for picking out values for a given taxon in downstream analyses)

Mostly just because of how we construct it (see the scatter in the WDL above), and that WDL 1.0 lacks a lot of the basic methods for navigating Maps and converting back and forth with Arrays.

…h if cant denovo

dpark01 added 14 commits February 5, 2024 16:22

defend against rather common empty output scenario

7064708

more compliant wdl

0593083

add new wdl task report_primary_kraken_taxa

0bedc96

add report_primary_kraken_taxa wdl task and add to classify_single

98f9bbd

add a few more outputs

e39919b

Merge remote-tracking branch 'origin/master' into dp-scaffold

a85d7c9

try wdl 1.1 and see what happens

9e12088

try wdl development and see what happens

02cf671

update to take tsv instead of json input for reference/tax map

d824518

attempt to not fail in scaffolding when some but not all segments of …

fa07252

…multi-segment genome are recovered

forgot $

031a294

remove random empty newline introduced in this branch

8a9b26f

fix bash logical construction

165eb66

Merge remote-tracking branch 'origin/master' into dp-scaffold

8c898c9

tomkinsc reviewed Feb 14, 2024

View reviewed changes

dpark01 added 3 commits February 14, 2024 16:40

initial draft of task for filtering reference list

1080d49

pre-extract taxdump tarball

1a77bf7

add optional kraken-based reference selection to multitaxa

d31c14a

tomkinsc reviewed Feb 15, 2024

View reviewed changes

dpark01 added 7 commits February 16, 2024 10:18

why cromwell do you behave poorly on edge cases

526cece

more stats and outputs, revert to refbased if cant denovo, dont polis…

f02a58b

…h if cant denovo

Merge remote-tracking branch 'origin/master' into dp-scaffold

ca24b2d

simplify cromwell fix

bc6bee7

Merge branch 'master' into dp-scaffold

6a71e1a

bump viral-classify 2.2.3.0 to 2.2.4.1

93d455f

revert version

88ca4d1

dpark01 marked this pull request as ready for review February 16, 2024 23:08

dpark01 enabled auto-merge February 17, 2024 00:01

dpark01 disabled auto-merge February 17, 2024 00:02

dpark01 merged commit 847d661 into master Feb 17, 2024
12 checks passed

dpark01 deleted the dp-scaffold branch March 1, 2024 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more scaffolding updates #511

more scaffolding updates #511

dpark01 commented Feb 13, 2024 •

edited

Loading

tomkinsc Feb 14, 2024

dpark01 Feb 15, 2024

tomkinsc Feb 15, 2024 •

edited

Loading

dpark01 Feb 15, 2024

more scaffolding updates #511

more scaffolding updates #511

Conversation

dpark01 commented Feb 13, 2024 • edited Loading

tomkinsc Feb 14, 2024

Choose a reason for hiding this comment

dpark01 Feb 15, 2024

Choose a reason for hiding this comment

tomkinsc Feb 15, 2024 • edited Loading

Choose a reason for hiding this comment

dpark01 Feb 15, 2024

Choose a reason for hiding this comment

dpark01 commented Feb 13, 2024 •

edited

Loading

tomkinsc Feb 15, 2024 •

edited

Loading